We are using a two-component hurdle model: first, the model predicts whether a disease will be present (binary), and if present, it predicts the case count (integer). Here we compare the results of a boosted tree model to our baseline model.
| .metric | desc | model | full_model |
|---|---|---|---|
| accuracy | proportion of the data that are predicted correctly | baseline | 0.85 |
| xgboost | 0.96 | ||
| kap | similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. | baseline | 0.45 |
| xgboost | 0.88 | ||
| sens | the proportion of positive results out of the number of samples which were actually positive. | baseline | 0.99 |
| xgboost | 0.98 | ||
| spec | the proportion of negative results out of the number of samples which were actually negative | baseline | 0.36 |
| xgboost | 0.90 |
| .metric | model | birds | buffaloes | camelidae | cats | cattle | cervidae | dogs | equidae | hares/rabbits | sheep/goats | swine |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| accuracy | baseline | 0.85 | 0.76 | 0.770 | 0.76 | 0.86 | 0.730 | 0.80 | 0.91 | 0.85 | 0.86 | 0.87 |
| xgboost | 0.95 | 0.96 | 0.960 | 0.97 | 0.95 | 0.970 | 0.95 | 0.97 | 0.96 | 0.96 | 0.96 | |
| kap | baseline | 0.42 | 0.20 | 0.130 | 0.38 | 0.56 | 0.059 | 0.52 | 0.42 | 0.20 | 0.47 | 0.42 |
| xgboost | 0.84 | 0.91 | 0.890 | 0.94 | 0.88 | 0.920 | 0.91 | 0.87 | 0.86 | 0.89 | 0.88 | |
| sens | baseline | 0.98 | 1.00 | 1.000 | 1.00 | 0.99 | 1.000 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| xgboost | 0.97 | 0.97 | 0.970 | 0.98 | 0.97 | 0.980 | 0.96 | 0.99 | 0.98 | 0.98 | 0.98 | |
| spec | baseline | 0.34 | 0.15 | 0.094 | 0.32 | 0.49 | 0.043 | 0.48 | 0.31 | 0.14 | 0.38 | 0.32 |
| xgboost | 0.85 | 0.94 | 0.920 | 0.96 | 0.91 | 0.940 | 0.94 | 0.87 | 0.87 | 0.90 | 0.90 |
| .metric | model | Africa | Americas | Asia | Europe | NA | Oceania |
|---|---|---|---|---|---|---|---|
| accuracy | baseline | 0.84 | 0.82 | 0.85 | 0.87 | 0.94 | 0.930 |
| xgboost | 0.95 | 0.96 | 0.96 | 0.95 | NA | 0.990 | |
| kap | baseline | 0.48 | 0.38 | 0.47 | 0.46 | 0.44 | 0.120 |
| xgboost | 0.88 | 0.91 | 0.89 | 0.84 | NA | 0.920 | |
| sens | baseline | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.000 |
| xgboost | 0.97 | 0.98 | 0.98 | 0.98 | NA | 1.000 | |
| spec | baseline | 0.40 | 0.30 | 0.38 | 0.37 | 0.33 | 0.068 |
| xgboost | 0.91 | 0.93 | 0.91 | 0.85 | NA | 0.920 |
| .metric | desc | model | full_model |
|---|---|---|---|
| accuracy | proportion of the data that are predicted correctly | baseline | 0.850 |
| xgboost | 0.960 | ||
| kap | similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. | baseline | 0.052 |
| xgboost | 0.540 | ||
| sens | the proportion of positive results out of the number of samples which were actually positive. | baseline | 0.470 |
| xgboost | 0.590 | ||
| spec | the proportion of negative results out of the number of samples which were actually negative | baseline | 0.680 |
| xgboost | 0.810 |
| .metric | model | birds | buffaloes | camelidae | cats | cattle | cervidae | dogs | equidae | hares/rabbits | sheep/goats | swine |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| accuracy | baseline | 0.850 | 0.760 | 0.77 | 0.760 | 0.860 | 0.7300 | 0.800 | 0.910 | 0.850 | 0.860 | 0.870 |
| xgboost | 0.950 | 0.960 | 0.96 | 0.970 | 0.950 | 0.9700 | 0.950 | 0.970 | 0.960 | 0.960 | 0.960 | |
| kap | baseline | 0.064 | 0.032 | 0.04 | 0.025 | 0.042 | 0.0032 | 0.039 | 0.082 | 0.043 | 0.052 | 0.061 |
| xgboost | 0.390 | 0.670 | 0.66 | 0.770 | 0.510 | 0.7500 | 0.660 | 0.510 | 0.540 | 0.550 | 0.540 | |
| sens | baseline | 0.440 | 0.580 | 0.55 | 0.570 | 0.430 | 0.5700 | 0.510 | 0.480 | 0.460 | 0.470 | 0.480 |
| xgboost | 0.530 | 0.620 | 0.63 | 0.700 | 0.570 | 0.6900 | 0.630 | 0.570 | 0.610 | 0.580 | 0.580 | |
| spec | baseline | 0.690 | 0.660 | 0.67 | 0.660 | 0.670 | 0.6400 | 0.670 | 0.700 | 0.680 | 0.680 | 0.690 |
| xgboost | 0.760 | 0.860 | 0.86 | 0.900 | 0.800 | 0.9000 | 0.860 | 0.800 | 0.810 | 0.820 | 0.810 |
| .metric | model | Africa | Americas | Asia | Europe | NA | Oceania |
|---|---|---|---|---|---|---|---|
| accuracy | baseline | 0.840 | 0.820 | 0.850 | 0.87 | 0.940 | 0.930 |
| xgboost | 0.950 | 0.960 | 0.960 | 0.95 | NA | 0.990 | |
| kap | baseline | 0.036 | 0.025 | 0.059 | 0.09 | 0.065 | 0.039 |
| xgboost | 0.540 | 0.590 | 0.550 | 0.49 | NA | 0.530 | |
| sens | baseline | 0.450 | 0.470 | 0.470 | 0.46 | 0.430 | 0.610 |
| xgboost | 0.560 | 0.580 | 0.600 | 0.58 | NA | 0.540 | |
| spec | baseline | 0.670 | 0.670 | 0.680 | 0.69 | 0.680 | 0.700 |
| xgboost | 0.810 | 0.830 | 0.820 | 0.79 | NA | 0.810 |
disease status variable importance and partial dependency (xgboost only)
## Feature Gain Cover Frequency
## 1: disease_status_lag1 0.776430287 0.054115316 0.033185841
## 2: cases_lag1_missing 0.049485755 0.043642648 0.025073746
## 3: ever_in_country_any_taxa 0.048458073 0.050788831 0.019911504
## 4: disease_status_lag2 0.017900276 0.036354726 0.020648968
## 5: log_human_population 0.013270873 0.044711771 0.135693215
## 6: cases_lag2_missing 0.009267981 0.006659092 0.016961652
## 7: disease_population_wild 0.009183998 0.014212483 0.007374631
## 8: log_gdp_per_capita 0.009140356 0.023998585 0.109882006
## 9: disease_status_lag3 0.006213909 0.034683627 0.016961652
## 10: cases_lag3_missing 0.005740374 0.017600986 0.014749263
## 11: log_taxa_population 0.005057817 0.020693720 0.061209440
## 12: cases_lag_sum_border_countries 0.004211446 0.047838273 0.044985251
Here we evaluate the subset of the training data with positive case counts
cases model stats
## # A tibble: 6 x 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 baseline rmse standard 210751.
## 2 xgboost rmse standard 260813.
## 3 baseline rsq standard 0.371
## 4 xgboost rsq standard 0.144
## 5 baseline mae standard 1787.
## 6 xgboost mae standard 2295.
cases variable importance and partial dependency (xgboost only)
## Feature Gain Cover Frequency
## 1: log_taxa_population 0.393012236 0.180119640 0.062023939
## 2: cases_lag1 0.200759320 0.203609619 0.205658324
## 3: country_iso3c_IRQ 0.130015681 0.063435733 0.005440696
## 4: log_gdp_per_capita 0.039913259 0.049233161 0.054406964
## 5: disease_mycoplasma_infection 0.036270814 0.025614941 0.003264418
## 6: log_human_population 0.030280061 0.039263981 0.079434168
## 7: cases_lag2 0.027312253 0.045814607 0.102285092
## 8: log_veterinarians_per_taxa 0.026401802 0.052468420 0.059847661
## 9: country_iso3c_RWA 0.016300578 0.027103687 0.009793254
## 10: cases_lag_sum_border_countries 0.008151194 0.028439215 0.071817193
## 11: country_iso3c_VNM 0.006341258 0.015041983 0.004352557
## 12: cases_lag3 0.006169676 0.008213044 0.068552775